Feed forward neural Network
Neural unit:
$f(z) = \sigma(z)$ is the activation function of the neural unit, where $\sigma$ is the sigmoid function.
other Activation function
FFNN is multilayer network with each layer composed of multiple neural units.
There is at least a hidden layer, and every layer's output is the input of the next layer.
suppose to have a network with 3 layers:
W are the weight matrix of the layer;
FNN for NLP (Classification):
Pooling:
coming back to the classification task:
if we want classify a test set of m examples:
all input sequence are packed in a matrix X of size (m x d), where m is the number of examples and d is the number of attribute of $x_pooled$.
ex of X Matrix: $X = \begin{bmatrix} x_{pooled_1}^{(1)} ... x_{pooled_d}^{(1)} \\ ... \\ x_{pooled_1}^{(m)} ... x_{pooled_d}^{(m)} \end{bmatrix}{m \times d}$
Feedforward Neural LM
assuming the words are independent:
so for a sequence of word the probability is:
Because NLM rapresent words by their embedding rather than word identity as in n-gram LM
- we need to minimize the loss function.
(Non rihiesta per esame)
BACKPROPAGATION:
we need to compute the gradient of the loss function with respect to the parameters of the network: $ \nabla{\omega}L = (\frac{\partial L}{\partial \omega_1}, \frac{\partial L}{\partial \omega_2}, ..., \frac{\partial L}{\partial \omega_n})$
this process is repeated until the loss function is minimized or until the number of epochs of training is reached.
hidden state is computed as:
in fact h is a summary with loss of information of the past.
hidden state vector is updated at each time step;
so after a certain number of time steps the h will not contain information about the beginning of the sequence.
A schematic representation of the RNN:
Inferenza in RNN as LM
similar to feedforward neural network.
the output $y_t$ is a probability distribution over the vocabulary of the possible next word.
giving input with dimension $d_{in}$
hidden layer with dimension $d_{h}$
output is a vector with dimension $d_{out}$.
$h_t = f(Uh_{t-1} + Wx_t)$
$y_t = f(Vh_t)$
RNN lenguage model process for each time step:
use word matrix E to retrieve the embedding of the current word $x_t$;
combining with the previous hidden state $h_{t-1}$ to compute the new hidden state $h_t$;
generate output layer from the hidden state $h_t$;
compute the probability distribution over the vocabulary of the possible next word;
Training process:
this is a self-supervised learning approach.
so the training data is unlabeled.
the label is the next word in the input sequence, to be compared with the output of the network.
compute the loss function and minimize it with backpropagation (through time).
using Teacher forcing: the input at time t is the true label at time t-1.
Weight tying:
Use the same embedding matrix and output weight matrix($V = E^T $).
$e_t = Ex_t$
$h_t = f(Uh_{t-1} + We_t)$
$y_t = softmax(E^Th_t)$
- this is useful because:
- E and V are trained to do the same thing
- E provides a embedding for each input word
- V provide an embedding for all the next possible words
- Using $$V = E^T$$ we use a single set of embedding weights for both the input and output layers.
- reduces the number of parameters to train.
- improves the performance of the model.
RNN for Sequence classification
perform text classification as:
Give a sequence of words, predict a single label.
Rnn's are difficult to train because of vanishing and exploding gradient problem. Lose information about the beginning of the sequence.
why lstm is introduced?
to solve the problem of vanishing gradient and exploding gradient.
To manage the context, LSTM use gates.
Gates are neural network layer that control the flow of information.
Gates are composed of
feedforward layer;
sigmoid neural net layer
pointwise multiplication operation
combining this operation with sigmoid layer
Gates:
$f_t = sigmoid(Uf *h_t + Wf*x_t)$
Modidfied context vector: $k_t = c_{t-1} \odot f_t$ (what delete from previous context vector)
$g_t$ -> select candidate state: select what new information to add from current input and previous hidden state.
input gate:
add gate: select what information to add to the context vector from the candidate state.
Common RNN NPL Architectures:
a) Sequence labeling (x1 -> y1, x2 -> y2, ...);
b) Sequence classification (x1, x2, ... -> y);
c) language modeling (x1 -> x2, x2 -> x3, ...);
d) encoder-decoder (x1, x2, ... -> y1, y2, ...);
Encoder-Decoder model with RNNs:
task can be solved with encoder-decoder architecture:
Encoder (can be LSTMs, CNN, Transformer):
RNN lenguage modeling for each timestep $t$:
Thus giving:
we get:
$p(y|x) = p(y1|x)p(y2|x, y1)p(y3|x, y1, y2)...p(y_m|x, y_1, ..., y_{m-1}) = \prod_{t=1}^{m} p(y_t|x, y_1, ..., y_{t-1})$
Rnn's have problem of context influence in long term dependencies from source to target sentence.
Introduction of attention mechanism to solve this problem.
allowing decoder to look at all the source words at each step of the decoding process.
I.e. decoder get info from all hidden states of encoder, not just the last one.
idea:
create context vetor $c_t$ as single-fixed length, taking a weighted sum of all the hidden states of the encoder.
weight select relevant part of the source sentence, as the decoder generate tokens of the target sentence.
encoder hidden state are different for each token in decoder -> context vector is dynamically derived at each step of decoding from the hidden states of the encoder.
computing $C_i$ consider:
how to focus on each encoder hidden state $he$.
how relevant each encoder hidden state is to the current decoder hidden state $hd_{i-1}$.
simplest score function (dot product attention (degree of similarity)):
$score(he_{j}, hd_{i-1}) = he_{j} * hd_{i-1}$ -> mesure the similarity between j-th encoder hidden state and i-th decoder hidden state.
vector of scores: describe how relevance each encoder hidden state is to the current decoder hidden state.
softmax function:
context vector:
$c_i = \sum_{j=1}^{n} \alpha_{ij} * he_{j}$
Trasformers:
is encoder-decoder architecture.
encoder:
decoder:
Trasformer encoder:
consist on a stack of N encoders.
output of each encoder is fed to the next encoder as input.
each encoder is composed of two sublayers:
Self-Attention-Layer:
Used to understand the relationship between different words in a sentence by computing a representation of the sentence that takes into account the relationship between all words.
maps input sequence of tokens $x = (x_1, ..., x_n)$ to sequence of vectors $y = (y_1, ..., y_n)$ of same length.
to generate output $y_m$ the model has access to every input token ($x_1, ..., x_m$).
this ensure to create a LM that can be use for auto-regressive generation i.e. generate one token at a time.
Self-Attention-mechanism
provide a way to compare a word of interest to other words in the same sentence, to retrieve their relevance in the current context.
$score(x_i, x_j) = x_i * x_j$
greater the dot product, more similar the two words are.
Score is normalized using softmax function to obtain a probability distribution over all words in the sentence.
$\alpha{ij} = softmax(score(x_i, x_j)) = \frac{exp(score(x_i, x_j))}{\sum_{k=1}^{n} exp(score(x_i, x_k))}, \forall j\leq i$;
the probability distribution is used to compute a weighted sum of the input sequence.
$y_i = \sum_{j\leq i}^{} \alpha_{ij} * x_j$
idea:
compute a representation of the input sequence by computing a weighted sum of the input sequence.
each input token is associated with three vectors:
Introducing weight matrices $W_q$, $W_k$ and $W_v$ each of this are used to compute the query, key and value vectors for each input $x_i$.
$W_v$ have dimension $d_{model} * d_v$;
query vector:
key vector:
value vector:
used to compute the weighted sum of the input sequence, aka the output fot the current focus of attention.
$v_i = W_v * x_i$, the value vector for the i-th input token.
score function:
Attention is quadratic in the length of the input sequence.
softmax function:
entire process is parallelized by packing input sequence (embeddig) of N tokens into a single matrix $X \in \mathbb{R}^{N \times d_{model}}$.
$Q = X * W_q$
Trasformer Bloks:
pass it in self attention layer to obtain a sequence of vector $SelfAttention(x)$
to $SelfAttention(x)$ is added the input sequence $x$ with a residual connection.
Residual connection is used to avoid to lose input information.
after residual connection is applied a normalization layer.
$LayerNorm(SelfAttention(x) + x)$
$LayerNorm = \gamma * \frac{x - \mu}{\sigma} + \beta$
multi-head attention:
each head is concatenated and multiplied by a weight matrix $W^O$ to obtain the final output of the multi-head attention layer $y$.
Positional Encoding:
since we fed the input sequence to the self-attention layer in parallel, the self-attention layer is not able to capture the order of the input sequence.
to capture the order of the input sequence is added a positional encoding to the input sequence.
positional encoding is a vector of the same dimension of the input sequence.
each element of the positional encoding is a function of the position of the token in the input sequence.
(non necessario per esame sapere le formule)
Basic idea of BERT: (Bidirectional Encoder Representations from Transformers)
ex: Consider the following two sentences:
Context-free embedding model such as word2vec, give the same embedding of the word 'Python'.
BERT will give different embeddings for the word 'Python' based on the context.